AITopics | incorrect option

Collaborating Authors

incorrect option

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Neural Information Processing SystemsJun-22-2026, 20:58:33 GMT

constraint, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Transportation > Passenger (0.45)
Transportation > Ground > Road (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

Xu, Rui, Lu, Dakuan, Zhao, Zicheng, Tan, Xiaoyu, Wang, Xintao, Yuan, Siyu, Chen, Jiangjie, Xu, Yinghui

arXiv.org Artificial IntelligenceNov-25-2025

Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models(MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances,each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.1845

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (0.68)
Transportation > Passenger (0.45)
Transportation > Ground > Road (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

Yuan, Botai, Zhou, Yutian, Wang, Yingjie, Huo, Fushuo, Jing, Yongcheng, Shen, Li, Wei, Ying, Shen, Zhiqi, Liu, Ziwei, Zhang, Tianwei, Yang, Jie, Tao, Dacheng

arXiv.org Artificial IntelligenceSep-25-2025

Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy -- models' tendency to uncritically echo user-provided information -- in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.20146

Country: Asia (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Johnny: Structuring Representation Space to Enhance Machine Abstract Reasoning Ability

Song, Ruizhuo, Yuan, Beiming

arXiv.org Artificial IntelligenceJun-4-2025

--This paper thoroughly investigates the challenges of enhancing AI's abstract reasoning capabilities, with a particular focus on Raven's Progressive Matrices (RPM) tasks involving complex human-like concepts. Firstly, it dissects the empirical reality that traditional end-to-end RPM-solving models heavily rely on option pool configurations, highlighting that this dependency constrains the model's reasoning capabilities. T o address this limitation, the paper proposes the Johnny architecture - a novel representation space-based framework for RPM-solving. Through the synergistic operation of its Representation Extraction Module and Reasoning Module, Johnny significantly enhances reasoning performance by supplementing primitive negative option configurations with a learned representation space. Furthermore, to strengthen the model's capacity for capturing positional relationships among local features, the paper introduces the Spin-Transformer network architecture, accompanied by a lightweight Straw Spin-Transformer variant that reduces computational overhead through parameter sharing and attention mechanism optimization. Experimental evaluations demonstrate that both Johnny and Spin-Transformer achieve superior performance on RPM tasks, offering innovative methodologies for advancing AI's abstract reasoning capabilities.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2506.0197

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

Option-ID Based Elimination For Multiple Choice Questions

Zhu, Zhenhao, Liu, Bulou, Ai, Qingyao, Liu, Yiqun

arXiv.org Artificial IntelligenceFeb-15-2025

Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs). Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. Existing methods to the PoE generally fall into two categories: one involves having the LLM directly select the incorrect options, while the other involves scoring the options. However, both methods incur high computational costs and often perform worse than methods that directly answer the MCQs with the option IDs. To address this issue, this paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability. We conduct experiments with 10 different LLMs in zero-shot settings on 7 publicly available datasets. The experimental results demonstrate that our method significantly improves the LLM's performance. Further analysis reveals that the sequential elimination strategy can effectively enhance the LLM's reasoning ability. Additionally, we find that sequential elimination is also applicable to few-shot settings and can be combined with debias methods to further improve LLM's performance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.15175

Country:

Europe (0.28)
Asia (0.28)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction

Lee, Yooseop, Kim, Suin, Jo, Yohan

arXiv.org Artificial IntelligenceJan-21-2025

In designing multiple-choice questions (MCQs) in education, creating plausible distractors is crucial for identifying students' misconceptions and gaps in knowledge and accurately assessing their understanding. However, prior studies on distractor generation have not paid sufficient attention to enhancing the difficulty of distractors, resulting in reduced effectiveness of MCQs. This study presents a pipeline for training a model to generate distractors that are more likely to be selected by students. First, we train a pairwise ranker to reason about students' misconceptions and assess the relative plausibility of two distractors. Using this model, we create a dataset of pairwise distractor ranks and then train a distractor generator via Direct Preference Optimization (DPO) to generate more plausible distractors. Experiments on computer science subjects (Python, DB, MLDL) demonstrate that our pairwise ranker effectively identifies students' potential misunderstandings and achieves ranking accuracy comparable to human experts. Furthermore, our distractor generator outperforms several baselines in generating plausible distractors and produces questions with a higher item discrimination index (DI).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.13125

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)
(11 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.46)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Wang, Chenxu, Jian, Ping, Yang, Zhen

arXiv.org Artificial IntelligenceSep-24-2024

Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model's capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at https://github.com/lalalamdbf/TPReasoner.

energy expenditure, experiment, rationale, (15 more...)

arXiv.org Artificial Intelligence

2409.14495

Country:

Asia > China > Beijing > Beijing (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Industry: Education > Assessment & Standards > Student Performance (0.60)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

Góral, Gracjan, Wiśnios, Emilia

arXiv.org Artificial IntelligenceAug-27-2024

This paper examines the zero-shot ability of Large Language Models (LLMs) to detect multiple-choice questions with no correct answer, a crucial aspect of educational assessment quality. We explore this ability not only as a measure of subject matter knowledge but also as an indicator of critical thinking within LLMs. Our experiments, utilizing a range of LLMs on diverse questions, highlight the significant performance gap between questions with a single correct answer and those without. Llama-3.1-405B stands out by successfully identifying the lack of a valid answer in many instances. These findings suggest that LLMs should prioritize critical thinking over blind instruction following and caution against their use in educational settings where questions with incorrect answers might lead to inaccurate evaluations. This research sets a benchmark for assessing critical thinking in LLMs and emphasizes the need for ongoing model alignment to ensure genuine user comprehension and assistance.

language model, wang, zhang, (17 more...)

arXiv.org Artificial Intelligence

2409.00113

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Technology > Educational Software (0.67)
Education > Educational Setting (0.48)
Education > Assessment & Standards (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models

Zeng, Zhongshen, Liu, Yinhong, Wan, Yingjia, Li, Jingyao, Chen, Pengguang, Dai, Jianbo, Yao, Yuxuan, Xu, Rongwu, Qi, Zehan, Zhao, Wanru, Shen, Linling, Lu, Jianqiao, Tan, Haochen, Chen, Yukang, Zhang, Hao, Shi, Zhan, Wang, Bailin, Guo, Zhijiang, Jia, Jiaya

arXiv.org Artificial IntelligenceJun-19-2024

Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io/.

incorrect option, kinetic energy, step-by-step solution, (14 more...)

arXiv.org Artificial Intelligence

2406.13975

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Jordan (0.04)
(13 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Education > Curriculum > Subject-Specific Education (1.00)
Materials > Chemicals (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback